Building Simple Models: A Case Study with Decision Trees
نویسندگان
چکیده
1 I n t r o d u c t i o n Many induction algorithms construct models with unnecessary structure. These models contain components tha t do not improve accuracy, and tha t only reflect random variation in a single da ta sample. Such models are less efficient to store and use than their correctly-sized counterparts . Using these models requires the collection of unnecessary data. Portions of these models are wrong and mislead users. Finally, excess s tructure can reduce the accuracy of induced models on new data [8]. For induction algorithms tha t build decision trees [1, 7, 10], pruning is a common approach to remove excess structure. Pruning methods take an induced tree, examine individual subtrees, and remove those subtrees deemed unnecessary. Pruning methods differ primarily in the criterion used to judge subtrees. Many criteria have been proposed, including statistical significance tests [10], corrected error est imates [7], and minimum description length calculations [9]. In this paper, we bring together three threads of our research on excess structure and decision tree pruning. First, we show that several common methods for pruning decision trees still retain excess structure. Second, we explain this phenomenon in terms of statistical decision making with incorrect reference distributions. Third, we present a method tha t adjusts for incorrect reference distributions, and we present an experiment that evaluates the method. Our analysis indicates that many existing techniques for building decision trees fail to consider the statistical implications of examining many possible subtrees. We show how a simple adjustment can allow such systems to make valid statistical inferences in this specific situation. X. Liu, P. Cohen, M. Berthold (Eds.): "Advances in Intelligent Data Analysis" (IDA-97) LNCS 1280, pp. 211-222, 1997. 9 Springer-Verlag Berlin Heidelberg 1997 212 JENSEN, GATES, AND COHEN 2 O b s e r v i n g E x c e s s S t r u c t u r e Consider Figure 1, which shows a typical plot of tree size and accuracy as a function of training set size for the UCI a u s t r a l i a n dataset. 1 Moving from leftto-right in the graph corresponds to increasing the number of training instances available to the tree building process. On the left-hand side, no training instances are available and the best one can do with test instances is to assign them a class label at random. On the right-hand side, the entire dataset (excluding test instances) is available to the tree building process. C4.5 [7] and error-based pruning (the c4.5 default) are used to build and prune trees, respectively. Note that accuracy on this dataset stops increasing at a rather small training set size, thereafter remaining essentially constant. 2 Surprisingly, tree size continues to grow nearly linearly despite the use of error-based pruning. The graph clearly shows that unnecessary structure is retained, and more is retained as the size of the training set increases. Accuracy stops increasing after only 25% of the available training instances are seen. The tree at tha t point contains 22 nodes. When 100% of the available training instances are used in tree construction, the resulting tree contains 64 nodes. Despite a 3-fold increase in size over the tree built with 25% of the data, the accuracies of the two trees are statistically indistinguishable. Under a broad range of circumstances, there is a nearly linear relationship between training set size and tree size, even after accuracy has ceased to increase. The relationship between training set size and tree size was explored with 4 pruning methods and 19 datasets taken from the UCI repository. 3 The pruning methods are error-based (EBB the C4.5 default) [7], reduced error (REP) [8], minimum description length (MDL) [9], and cost-complexity with the lsE rule (ccP) [1]. The majority of extant pruning methods take one of four general approaches: deflating accuracy estimates based on the training set (e.g. EBP); pruning based on accuracy estimates from a pruning set (e.g. aEP); managing the tradeoff between accuracy and complexity (e.g. MDL); and creating a set of pruned trees based on different values of a pruning parameter and then selecting the appropriate parameter value using a pruning set or cross-validation (e.g. ccP). The pruning methods used in this paper were selected to be representative of these four approaches. Plots of tree size and accuracy as a function of training set size were generated for each combination of dataset and pruning algorithm as follows. Typically, 1 All datasets in this paper can be obtained from the University of California-Irvine (UCI) Machine Learning Repository. http ://ww~. its. uci. edu/ mlearn/MLRepository, html. 2 All reported accuracy figures in this paper are based on separate test sets, distinct from any data used for training. 3 The datasets are the same ones used in [4] with two exceptions. The crx dataset was omitted because it is roughly the same as the aus t r a l i aa dataset, and the horse-co l ic dataset was omitted because it was unclear which attribute was used as the class label. Note that the votel dataset was created by removing the physician-fee-freeze attribute from the vote dataset. BUILDING SIMPLE MODELS: A CASE STUDY WITH DECISION TREES 213
منابع مشابه
Estimating Suspended Sediment by Artificial Neural Network (ANN), Decision Trees (DT) and Sediment Rating Curve (SRC) Models (Case study: Lorestan Province, Iran)
The aim of this study was to estimate suspended sediment by the ANN model, DT with CART algorithm and different types of SRC, in ten stations from the Lorestan Province of Iran. The results showed that the accuracy of ANN with Levenberg-Marquardt back propagation algorithm is more than the two other models, especially in high discharges. Comparison of different intervals in models showed that r...
متن کاملComparison of Three Decision-Making Models in Differentiating Five Types of Heart Disease: A Case Study in Ghaem Sub-Specialty Hospital
Introduction: cardiovascular diseases are becoming the main cause of mortality and morbidity in most countries. This research goal was to predict the types of heart diseases for more accurate diagnosis by data mining and neural network technics. Method: This research was an applied-survey study and after data preprocessing, three approaches of neural network, decision making tree and Bayes simp...
متن کاملComparison of Three Decision-Making Models in Differentiating Five Types of Heart Disease: A Case Study in Ghaem Sub-Specialty Hospital
Introduction: cardiovascular diseases are becoming the main cause of mortality and morbidity in most countries. This research goal was to predict the types of heart diseases for more accurate diagnosis by data mining and neural network technics. Method: This research was an applied-survey study and after data preprocessing, three approaches of neural network, decision making tree and Bayes simp...
متن کاملImproving reservoir rock classification in heterogeneous carbonates using boosting and bagging strategies: A case study of early Triassic carbonates of coastal Fars, south Iran
An accurate reservoir characterization is a crucial task for the development of quantitative geological models and reservoir simulation. In the present research work, a novel view is presented on the reservoir characterization using the advantages of thin section image analysis and intelligent classification algorithms. The proposed methodology comprises three main steps. First, four classes of...
متن کاملOvertting Explained: a Case Study 1 Introduction
Over tting is a widely observed pathology of induction algorithms. Over tted models contain unnecessary structure that re ects nothing more than random variation in the data sample used to construct the model. Such models are less e cient to store and use than their correctly-sized counterparts. Using these models requires the collection of unnecessary data. Portions of over tted models are wro...
متن کاملA New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining
Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...
متن کامل